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Fast forward feature selection for the nonlinear 
classification of hyperspectral images 
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Abstract 

A fast forward feature selection algorithm is presented in this paper. It is based on a Gaussian mixture model 
(GMM) classifier. GMM are used for classifying hyperspectral images. The algorithm selects iteratively spectral 
features that maximizes an estimation of the classification rate. The estimation is done using the k-fold cross 
validation. In order to perform fast in terms of computing time, an efficient implementation is proposed. First, the 
GMM can be updated when the estimation of the classification rate is computed, rather than re-estimate the full 
model. Secondly, using marginalization of the GMM, sub models can be directly obtained from the full model 
learned with all the spectral features. Experimental results for two real hyperspectral data sets show that the method 
performs very well in terms of classification accuracy and processing time. Furthermore, the extracted model contains 
very few spectral channels. 
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Fast forward feature selection for the nonlinear 
classification of hyperspectral images 

I. Introduction 

Since the pioneer paper of J. Jimenez and D. Landgrebe 01, it is well known that hyperspectral images need 
specific processing techniques because conventional ones made for multispectral/panchromatic images do not adapt 
well to hyperspectral images. Generally speaking, the increasing number of spectral channels poses theoretical and 
practical problems Q. In particular, for the purpose of pixel classification, the spectral dimension needs to be 
handle carefully because of the “Hughes phenomenon” ||3l : with a limited training set, beyond a certain number of 
spectral features, a reliable estimation of the model parameters is not possible. 

Many works have been published since the 2000s to address the problem of classifying hyperspectral images. A 
non-exhaustive list should include techniques from the machine learning theory (Support Vector Machines, Random 
Forest, neural networks) ||4l, statistical models ||T1 and dimension reduction Q. SVM, and kernel methods in general, 
have shown remarkable performances on hyperspectral data in terms of classification accuracy Q. However, these 
methods may suffer from a high computational load and the interpretation of the model is usually not trivial. 

In parallel to the emergence of kernel methods, the reduction of the spectral dimension has received a lot of 
attention. According to the absence or presence of training set, the dimension reduction can be unsupervised or 
supervised. The former try to describe the data with a lower number of features that minimize a reconstruction error 
measure, while the latter try to extract features that maximize the separability of the classes. One of the most used 
unsupervised feature extraction method is the principal component analysis (PCA) 111. But it has been demonstrated 
that PCA is not optimal for the purpose of classification Q. Supervised method, such as the Fisher discriminant 
analysis or the non-weighted feature extraction have shown to perform better for the purpose of classification. 
Other feature extraction techniques, such as independent component analysis l[8l, have been applied successfully 
and demonstrate that even SVM can benefits from feature reduction l|9l, ifTOl . However, conventional supervised 
techniques suffer from similar problems than classification algorithms in high dimensional space. 

Rather than supervised and unsupervised techniques, one can also distinguish dimension reduction techniques into 
feature extraction and feature selection. Feature extraction returns a linear/nonlinear combination of the original 
features, while feature selection returns a subset of the original features. While feature extraction and feature 
selection both reduce the dimensionality of the data, the latter is much more interpretable for the end-user. 
The extracted subset corresponds to the most important features for the classification, i.e., the most important 
wavelengths. For some applications, these spectral channels can be used to infer mineralogical and chemical 
properties m. 

Feature selection techniques generally need a criterion, that evaluates how the model built with a given subset of 
features performs, and an optimization procedure that tries to find the subset of features that maximizes/minimizes 
the criterion ifT^ . Several methods have been proposed according to that setting. For instance, an entropy measure 
and a genetic algorithm have been proposed in ifT^ Chapter 9], but the band selection was done independently 
of the classifier, i.e., the criterion was not directly related to the classification accuracy. Jeffries Matusita (JM) 
distance and steepest-ascent like algorithms were proposed in l\14\l . The method starts with a conventional sequential 
forward selection algorithm, then the obtained set of features is updated using local search. The method has been 
extended to take into account spatial variability between features in snu where a multiobjective criterion was 
used to take into account the class separability and the spatial variability. JM distance and exhaustive search 
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as well as some refinement techniques have been proposed also in hi2'^ . However rather than extracting spectral 
features, the algorithm returns the average over a certain bandwidth of contiguous channels, which can make the 
interpretation difficult and often leads to select a large part of the electromagnetic spectrum. Similarly, spectral 
intervals selection was proposed in m, where the criterion used was the square representation error (square 
error between the approximate spectra and the original spectra) and the optimization problem was solved using 
dynamic programming. These two methods reduce the dimensionality of the data, but cannot be used to extract 
spectral variables. Recently, forward selection and genetic algorithm driven hy the classification error minimization 
has been proposed in ifTTl . 

Feature selection has been also proposed for kernel methods. A recursive scheme used to remove features that 
exhibit few influence on the decision function of a nonlinear SVM was discussed in hI8\l . Alternatively, a shrinkage 
method based on ii-norm and linear SVM has been investigated by Tuia et al. / Ii9l/ . The authors proposed a method 
where the features are extracted during the training process. However, to make the method tractable in terms of 
computational load, a linear model is used for the classification, that can limit the discriminating power of the 
classifier. In / I20I/ . a dependence measure between spectral features and thematic classes is proposed using kernel 
evaluation. The measure has the advantage to apply to multiclass problem making the interpretation of the extracted 
features easier. 

Feature selection usually provides good results in terms of classification accuracy. However, several drawbacks 
can be identified from fhe above menfioned liferafure: 

• If can be very fime consuming, in particular when nonlinear classificalion models are used. 

• When linear models are used for fhe selection of fealures, performances in terms of classification accuracy 
are not satisfying and therefore another nonlinear classifier should be used after the feature extraction. 

• For multiclass problem, it is sometimes difficult to interpret the extracted features when a collection of binary 
classifiers is used (e.g., SVM). 


In this work, it is proposed to use a forward strategy, based on ll^ . that uses an efficient implementation scheme 
and allows to process a large amount of data, both in terms of number of samples and variables. The method, 
called nonlinear parsimonious feature selection (NPFS), selects iteratively a spectral feature from the original set 
of features and adds it to a pool of selected features. This pool is used to learn a Gaussian mixture model (GMM) 
and each feature is selected according to a classification rate. The iteration stops when the increased in terms of 
classification rate is lower than a user defined fhreshold or when fhe maximum number of feafures is reached. In 
comparison fo ofher fealure exfracfion algorifhms, fhe main confribufions of NPFS is fhe abilify fo select spectral 
features through a nonlinear classification model and its high computational efficiency. Furthermore, NPFS usually 
extracts a very few number of features (lower than 5 % of the original number of spectral features). 

The remaining of the paper is organized as follows. Section [^presents the algorithm with the Gaussian mixture 
model and the efficient implementation. Experimental results on three hyperspectral data sets are presented and 


discussed in Section III Conclusions and perspectives conclude the paper in Section IV 


H. Non linear parsimonious feature selection 

The following notations are used in the remaining of the paper. S = denotes the set of training 

pixels, where Xj G is a d-dimensional pixel vector, yj = 1,..., C is its corresponding class, C the number of 
classes, n the total number of training pixels and ric the number of training pixels in class c. 
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A. Gaussian mixture model 


For a Gaussian mixture model, it is supposed that the observed pixel is a realization of a d-dimensional random 
vector such as 


c 

f(x) = ^7rcp(x|c), 

c=l 


( 1 ) 


where ttc is the proportion of class c (0 < ttc < 1 and Yl^=i^c = 1) and p(x|c) is a d-dimensional Gaussian 
distribution, i.e., 

= (2^)d/2|S,|l/2 """P " /^c)'^5]-l(x - . 

with /Xg being the mean vector of class c, X)c being the covariance matrix of class c and |5]c| its determinant. 
Following the maximum a posteriori rule, a given pixel is classified to the class c if p(c|x) > p(/c|x) for all 
A; = 1,..., C. Using the Bayes formula, the posterior probability can be written as 


^ I ^ vrcp(x|c) 

p(c|x) = —-—. 

Ilk=l^kP{^\k) 

Therefore, the maximum a posteriori rule can be written as 


( 2 ) 


X belongs to c c = arg max 7rfcp(x|fc). (3) 

k=l,...,C 

By taking the log of eq. Q the final decision rule is obtained (also known as quadratic discriminant function) 


Qc(x) = -(x - ^(x - //J - ln(|5]c|) -h 21n(7rc 


(4) 


Using standard maximization of the log-likelihood, the estimator of the model parameters are given by 


TTn = 


nc 

n 

1 


i^c = 

n 


X,:, 


Sc = — 2(Xi -/iJ(Xi 


i=l 


(5) 

( 6 ) 

(7) 


with ric is the number of sample of class c. 

For GMM, the “Flughes phenomenon” is related to the estimation of the covariance matrix. If the number of 
training samples is not sufficient for a good estimation the computation of the inverse and of the determinant in 
eq.Q will be very numerically unstable, leading to poor classification accuracy. For instance for the covariance 
matrix, the number of parameters to estimate is equal to d(d -|- l)/2: if d = 100 then 5050 parameters have to be 
estimated then the minimum number of training samples for the considered class should be at least 5050. Note in 
that case the estimation will be possible but not accurate. Feature selection tackles this problem by allowing the 
construction of GMM with a reduced number p of variables, with p « d and p{p + l)/2 < ric- 


B. Forward feature selection 

The forward feature selection works as follow ll22l Chapter 3]. It starts with an empty pool of selected features. 
At each step, the feature that most improves an estimation of the classification rate is added to the pool. The 
algorithm stops either if the increase of the estimated classification rate is too low or if the maximum number of 
features is reached. 
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The fe-fold cross-validation (k-CV) is used in this work to estimate the classification rate. To compute the fc-CV, 
a subset is removed from S and the GMM is learned with the remaining training samples. A test error is computed 
with the removed training samples used as validation samples. The process is iterated k times and the estimated 
classification rate is computed as the mean test error over the k subsets of S. 

The efficient implementation of the NPFS relies on a fast estimation of the parameters of the GMM when 
the k-CY is computed. In the following, it will be shown that by using update rules of the parameters and the 
marginalization properties of the Gaussian distribution, it is possible to perform k-CY and forward selection quickly. 
As a consequence, the GMM model is learned only one time during the whole training step. The algorithm [T] presents 
a pseudo code of the proposed method. 

1) Fast estimation of the model on : In this subsection, it is shown that each parameter can be easily 
updated when a subset is taken off S. 

Proposition 1 (Proportion): The update rule for the proportion is 


TT 


n—u 

c 


nfr" - Vc 
n — V 


( 8 ) 


where and if” are the proportions of class c computed over n — u and n respectively, u is the number of 

removed samples from S, Uc is the number of removed samples from class c such as 
Proposition 2 (Mean vector): The update rule for the mean vector is 

Uc - Uc 

where and the mean vectors of class c computed over the Uc and Hc—Uc training samples respectively, 

is the mean vector of the Uc removed samples from class c. 

Proposition 3 (Covariance matrix): The update rule for the covariance matrix is 





ricUc 

(uc - Uc)"^ 


(A 


Tie 

C 


(AJ 



( 10 ) 


Tic '' Tlc — l^c 

where 5]^ and are the covariance matrices of class c computed over the Uc and Uc — Uc training samples 

respectively. 

2 ) Particular case of leave-one-out cross-validation: When very few training samples are available, it is some¬ 
times necessary to resort to leave-one-out cross-validation (k = n). Updates rules are still valid, but it is also 
possible to get a fast update of the decision function. If the removed sample does not belong to class c, only the 
proportion term in eq. Q change, therefore the updated decision rule can be written as: 

+21n(^). (11) 

where and are the decision rules for class c computed with Uc and nc — 1 samples respectively. If the 

removed sample x^, belongs to class c then updates rules become: 

Proposition 4 (Proportion-loocv): 


Proposition 5 (Mean vector-loocv): 


nK - 1 

n — 1 


ncAc° 


Xr. 


( 12 ) 


nc - 1 


( 13 ) 
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Algorithm 1 NPFS pseudo code 

Require: 5, k, delta, maxvariable 
1: Randomly cut S into k subsets such as Si U ... U Sk = S and Si Pi = 0 
2: Learn the full GMM with S 

3: Initialize the set of selected variables ifia to empty set (|v3s| = 0) and available variables ifia to the original set of variables (|ipa| = d) 
4: while |(ps| < maxvariable do 

5: for all 5^ C 5 do 

6: Update the model using eq. and l |10| ( (or their loocv counterparts) according to Su 

7: for all s C do 

8: Compute the classification rate on Su for each set of variables i/Js H s using the marginalization properties 

9: end for 

10: end for 

11: Average the classification rate over the fc-fold 

12: if Improvement in terms of classification rate w.r.t. previous iteration is lower than delta then 

13: break 

14: else 

15: Add the variable s corresponding to the maximum classification rate to ips and remove it from pa 

16: end if 

17: end while 


Proposition 6 (Covariance matrix-loocv): 


,nc— 1 


nc 


{x„ - A?-) (X„ - Alf 


{nc - 1 )^ 


(14) 


where nc — 1 denotes that the estimation is done with only nc — 1 samples rather than the ric samples of the class. 

An update rule for the case where the sample belongs the class c can he written hy using the Cholesky 
decomposition of the covariance matrix and rank-one downdate, hut the downdate step is not numerically stable 
and not used here. 

3) Marginalization of Gaussian distribution: To get the GMM model over a subset of the original set of features, 
it is only necessary to drop the non-selected features from the mean vector and the covariance matrix |[23l . For 
instance, let x = where x^ and x„s are the selected variables and the non-selected variables respectively, 

the mean vector can be written as 


A = [^6, 


Mns] 


T 


(15) 


and the covariance matrix as 


S = 


^s,s ^s,ns 
^ns,s ^ns,ns 


(16) 


The marginalization over the non-selected variables shows that x^ is also a Gaussian distribution with mean vector 
fig and covariance matrix 5]s,s- Hence, once the full model is learned, all the sub-models built with a subset of the 
original variables are available at no computational cost. 


III. Experimental results 

A. Data 

Two data sets have been used in the experiments. The first data set has been acquired in the region surrounding 
the volcano Hekla in Iceland by the AVIRIS sensor. 157 spectral channels from 400 to 1,840 nm were recorded. 
12 classes have been defined for a total of 10,227 referenced pixels. The second data set has been acquired by the 
ROSIS sensor during a flight campaign over Pavia, nothern Italy. 103 spectral channels were recorded from 430 to 
860 nm. 9 classes have been defined for a total of 42776 referenced pixels. 
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TABLE I 

Classification accuracies for Hekla data set. The results correspond to the mean value and variance of the 
OVERALL ACCURACY OVER THE 50 REPETITIONS. THE BEST RESULT FOR EACH TRAINING SETUP IS REPORTED IN BOLD FACE. n-NPFS 
AND 5-NPFS CORRESPOND TO THE NPFS COMPUTED WITH THE LEAVE-ONE-OUT AND 5-FOLD CROSS-VALIDATION, RESPECTIVELY. 
RFF, SVM^J AND SVMf^ CORRESPOND TO THE RECURSIVE FEATURE EXTRACTION SVM, THE LINEAR SVM WITH CONSTRAINT 
AND THE LINEAR SVM WITH WITH THE EXPLICIT ORDER 2 POLYNOMIAL FEATURE SPACE, RESPECTIVELY. SVMpoly AND SVMgauss 
CORRESPOND TO THE CONVENTIONAL NONLINEAR SVM WITH A ORDER 2 POLYNOMIAL KERNEL AND GAUSSIAN KERNEL, 

RESPECTIVELY. 


Uc 

n-NPFS 

5-NPFS 

RFF 

SVM,i 

SVM?^ 

SVMpoly 

SVMgaus. 

50 

92.5 ± 1.2 

92.4 ± 1.2 

90.2 ± 1.8 

90.3 ± 1.0 

91.6 ± 0.6 

84.6 ± 1.6 

90.4 ± 1.6 

too 

94.8 ± 0.7 

94.6 ± 0.6 

95.6 ± 0.3 

93.9 ± 0.5 

94.8 ± 0.1 

91.4 ± 0.4 

95.6 ± 0.3 

200 

95.9 ± 0.3 

95.8 ± 0.3 

96.8 ± 1.1 

95.6 ± 0.1 

96.3 ± 0.1 

95.5 ± 0.1 

96.8 ± 1.1 


TABLF II 

Classification accuracies for University of Pavia data set. Same notations than in TableUI 


ric 

n-NPFS 

5-NPFS 

RFF 

SVM^i 

SWMl 

SVMpoly 

SVMgaua. 

50 

82.2 ± 4.4 

83.4 ± 7.6 

84.7 ± 4.0 

75.1 ± 2.5 

81.0 ± 2.8 

82.9 ± 3.4 

84.8 ± 3.4 

100 

86.3 ± 3.2 

85.9 ± 3.1 

88.4 ± 0.9 

77.3 ± 1.4 

83.6 ± 1.3 

86.5 ± 1.6 

88.4 ± 1.4 

200 

87.7 ± 3.1 

87.9 ± 1.9 

90.8 ± 0.3 

78.5 ± 0.7 

85.5 ± 0.4 

88.8 ± 0.6 

90.8 ± 0.3 


For each data set, 50, 100 and 200 training pixels per class were randomly selected and the remaining referenced 
pixels were used for the validation. 50 repetitions were done for which a new training set have been generated 
randomly. 

B. Competitive methods 

Several conventional feature selection methods have been used as baseline. 

• Recursive Feature Elimination (RFE) for nonlinear SVM ifTSl . In the experiment, a Gaussian kernel was used. 

• Einear SVM with (SVM^J constraint on the feature vector ifT^ based on the EIBEINEAR implementa¬ 
tion ll2^ . 

• To overcome the limitation of the linear model used in EIBEINEAR, an explicit computation of order 2 poly¬ 
nomial feature space has been used together with EIBEINEAR (SVM^ ). Eormally, a nonlinear transformation 
(j) has been apply on the original samples: 

X = [xi, . . . , Xrf] FA 4>{x) = [xi, ...,Xd, xl,XlX2, . . . , XlXd, X2, X2X3, . . . , X^] 

with p = Eor Hekla data and University of Pavia data, the dimension p of the projected space is 12561 

and 5460, respectively. 

Eor comparison, a SVM with a Gaussian kernel and a order 2 polynomial kernel classifier, based on the 
EIBSVM II 25 II . with all the variables have been used too. 

Eor the linear/nonlinear SVM, the penalty parameter and the kernel hyperparameters were selected using 5-fold 
cross-validation. Eor NPES, the threshold (delta in Algorithm^ was set to 0.5% and the maximum number of 
extracted features was set to 20. The estimation of the error has been computed with a leave-one-out CV (n-NPES) 
and a 5-fold CV (5-NPES). Each variable has been standardized before the processing (i.e., zero mean and unit 
variance). 
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I- 1 n-NPFS 

I- 1 5-NPFS 

^ SYMe, 
r^RFE 


Number of training samples per class 


Fig. 1. Mean number Us of selected features for the different methods for Hekla data set. The red line indicates the original number of 
spectral features. Projected ii SVM is not reported because the mean number of extracted features was too high (e.g., 6531 for nc=50). 


CO 
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I- 1 n-NPFS 

I- 1 5-NPFS 

^ SYMe, 
E^RFE 


Number of training samples per class 


Fig. 2. Mean number fis of selected features for the different methods for University of Pavia data set. The red line indicates the original 
number of spectral features. Projected h SVM is not reported because the mean number of extracted features was too high (e.g., 5110 for 
nc=50). 


C. Results 

The mean accuracies and the variance over the 50 runs are reported in Table 1^ and Table The mean numbers 
of extracted features for the different methods are reported in Figure [T] and Figure 

From the tables, it can be seen that there is no difference in the results obtained with n-NPFS or 5-NPFS. They 
perform equally on both data sets in terms of classification accuracy or number of extracted features. However, 
5-NPFS is much faster in terms of computation time. 

RFE and SVMgauss provide the best results in terms of classification accuracy, except for the Hekla data set and 
Uc = 50. From the figure, it can be seen that the number of extracted features is almost equal to original number 
of spectral features, meaning that in these experiments RFE is equivalent to SVMgauss- Hence, RFE was not able 
to extract /ew relevant spectral features. 

£i SVM applied on the original features or the projected features is not able to extract relevant features. In terms 
of classification accuracy, the linear SVM does not perform well for the University of Pavia data set. Nonlinear £i 
SVM provides much better results for both data sets. In comparison to the non sparse nonlinear SVM computed 
with an order 2 polynomial kernel, ii nonlinear SVM performs better in terms of classification accuracy for the 
Hekla data while it performs worst for the University of Pavia data. 

In terms of number of extracted features, NPES provides the best results, by far, with an average number of 
extracted features equal to 5% of the original number. All the others methods were not able to extract few features 






























Fig. 3. Classification rate in function of the number of extracted features. Continuous line corresponds to 5-NPFS, dashed line to SVM 
with a Gaussian kernel and dash-doted line to a linear SVM. 

TABLE III 

Mean processing time in second in function of the number of samples per class for the University of Pavia data set. 20 repetitions have 
been done on laptop with 8Gb of RAM and Intel(R) Core(TM) i7-3667U CPU @ 2.00GHz processor. 


Us 

50 

100 

200 

400 

SVMgauss 

11 

40 

140 

505 

SVM^, 

52 

115 

234 

498 

n-NPFS 

242 

310 

472 

883 

5-NPFS 

35 

31 

29 

43 


without decreasing drastically the overall accuracy. For instance, for the Hekla data set and ric = 50, only 7 spectral 
features are used to huild the GMM and leads to the best classification accuracy. A discussion on the extracted 
features is given in the next subsection. 

The figure^presents the averaged classification rate of 5-NPFS, SVM with a Gaussian kernel and a linear SVM 
applied on the selected features with 5-NPFS. 20 repetitions have been done on the University data set with nc—50. 
The optimal parameters for SVM and linear SVM have been selected using 5-fold cross-validation. From the figure, 
it can be seen that the three algorithms have similar trends. When the number of features is relatively low (here 
lower than 15) GMM performs the best, but when the number of features increases too much, SVM (non linear and 
linear) performs better in terms of classification accuracy. It is worth noting that such observations are coherent 
with the literature: SVM are known to perform well in high dimensional space, while GMM is more affected by 
the dimension. Yet, NPFS is able to select relevant spectral variables, for itself, or for other (possibly) stronger 
algorithms. 

The mean processing time for the University of Pavia data set for several training set sizes is reported in 


Table III It includes parameter optimization for SVM gauss and SVMi^. Note that the RFE consists in several 
SVM gauss optimization, one for each feature removed (hence, if 3 features are removed, the mean processing time 
is approximately multiply by 3). It can be seen that the 5-NPFS method is a little influenced by the size of the 
training set: what is important is the number of (extracted) variables. For rig = 50, the processing time is slightly 
higher because of overload due to parallelization procedure. n-NPFS is the more demanding in terms of processing 
time and thus should be used only when the number of training samples is very limited. Finally, it is important to 
underline that the NPFS is implemented in Python while SVM used a state of the art implementation in C++ / I25I/ . 

From these experiments, and from a practical viewpoint, NPFS is a good compromise between high classification 
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Fig. 4. The dotted line and the crossed line represent the mean error rate and the mean number of selected features, respectively, as a 
function of delta. The simulation was done on the University of Pavia data set, with ric = 50 and for the 5-NPFS algorithm. 


accuracy and sparse modeling. 

D. Discussion 

The extracted channels hy 5-NPFS and n-NPFS were compared for one training set of the University of Pavia 
data set: two channels were the same for both methods, 780nm and 776nm; two channels were very close, 555nm 
and 847nm for 5-NPFS and 551nm and 855nm for n-NPFS; one channel was close, 521nm for 5-NPFS and 501nm 
for n-NPFS. The other channel selected hy n-NPFS is 772nm. If the process is repeated, the result is terms of 
selected features hy n-NPFS and 5-NPFS is similar: on average 35% of the selected features are identical (not 
necessarily the first ones) and the others selected features are close in terms of wavelength. 

The influence of the parameter delta has been investigated on the University of Pavia data set. 20 repetitions 
have been done with ric = bO for several values of delta. Results are reported on figure^ From the figure, 
it can be seen that when delta is set to a value larger than approximately 1%, the algorithm stops too early 
and the number of selected features is too low to perform well. Conversely, setting delta to a small value does 
not change the classification rate, a plateau being reached for delta lower than 0.5%. In fact, because of the 
“Hughes phenomenon”, adding spectral features to the GMM will first lead to an increase of the classification rate 
but then (after a possible plateau) the classification rate will decrease, i.e., the improvement after two iterations of 
the algorithm will be negative. 

Figure 1^ presents the most selected features for the University of Pavia data set. 1000 random repetitions have 
been done with nc-200 and the features shaded in the figure have been selecfed at least 10% times (i.e., 100 times 
over 1000) using 5-NPFS. Five spectral domains can be identified, fwo from fhe visible range and fhree from the 
near infrared range. In particular, it can be seen that spectral channels from the red-edge part are selected. The 
width of the spectral domain indicates the variability of the selection. The high correlation between adjacent spectral 
bands makes the variable selection “unstable”, e.g., for a given training set, the channel t would be selected but 
for another randomly selected training set it might be the channel f -|- 1 or f — 1. It is clearly a limitation of the 
proposed approach. 

To conclude this discussion, similar spectral channels are extracted with n-NPFS and 5-NPFS, while the latter 
is much times faster. Hence, n-NPFS should be only used when very limited number of samples is available. A 
certain variability is observed in the selection of the spectral channels due to the high correlation of adjacent spectral 
channels and the step-wise nature of the method. 
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Fig. 5. Most selected spectral domain for the University of Pavia data set. Gray bars correspond to the most selected parts of the spectral 
domain. Horizontal axis corresponds to the wavelength (in nanometers). The mean value of each class is represented in continuous colored 
lines. 


IV. Conclusion 

A nonlinear parsimonious feature selection algorithm for the classification of hyperspectral images and the 
selection of spectral variables has been presented. Using a Gaussian mixture model classifier, specfral variables are 
extracted iteratively based on the cross-validation estimate of the classification rate. An efficient implementation is 
proposed that takes into account some properties of Gaussian mixture model: a fast update of the model parameters 
and a fast access to the sub-models. Experimental results show that the proposed method is able to select few 
relevant features, and outperform standard SVM-based sparse algorithms while reaching similar classification rates 
to those obtained with SVM. Furthermore, in comparison to SVM based feature selection algorithm, multiclass 
problem is handled by the GMM making the interpretation of the extracted channels easier. 

More investigation are needed to fully understand which features are extracted, since the method is purely 
statistical. If the red-edge has been identified, the others extracted features are not clearly interpretable. Moreover, 
small variability has been observed due to the high correlation between adjacent bands and the step-wise procedure. 
To overcome this limitation, a continuous interval selection strategy, as in lIT^ . will be investigated. Also, a steepest- 
ascent search strategy could be used to make the final solution more stable. 

The python code of the algorithm is available freely for download: https://github.com/mfauvel/FFFS. 
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